Detecting Complex Dependencies in Categorical Data

نویسندگان

  • Tim Oates
  • Matthew D. Schmill
  • Dawn E. Gregory
  • Paul R. Cohen
چکیده

Locating and evaluating relationships among values in multiple streams of data is a di cult and important task. Consider the data owing from monitors in an intensive care unit. Readings from various subsets of the monitors are indicative and predictive of certain aspects of the patient's state. We present an algorithm that facilitates discovery and assessment of the strength of such predictive relationships called Multi-stream Dependency Detection (msdd). We use heuristic search to guide our exploration of the space of potentially interesting dependencies to uncover those that are signi cant. We begin by reviewing the dependency detection technique described in [3], and extend it to the multiple stream case, describing in detail our heuristic search over the space of possible dependencies. Quantitative evidence for the utility of our approach is provided through a series of experiments with arti cially-generated data. In addition, we present results from the application of our algorithm to two real problem domains: feature-based classi cation and prediction of pathologies in a simulated shipping network. 1. Dependency Detection A dependency is an unexpectedly frequent or infrequent co-occurrence of events over time. Our goal is to nd dependencies between tokens contained in multiple streams. A stream is sequence of values produced over time, and a token is one of the nite set of values that a stream can produce. Dependencies across multiple streams may take many forms: perhaps token a in stream 1 predicts token b in stream 2, or perhaps token a in stream 1 and token c in stream 2 predict token b in stream 2. In general, if stream j contains tj distinct tokens, there are [ Qn j=1 tj + 1] 2 possible dependencies between two items. The dependency detection technique in [3] uses contingency tables to assess the signi cance of dependencies in a single stream of data. Let (tp; ts; ) denote a dependency. Each dependency rule states that when the precursor token, tp, occurs at time step i in the stream, the successor token, ts, will occur at time step i + in the stream with some probability. When this probability is high, the dependency is strong. Consider the stream acbabaccbaabacbbacba. Of all 19 pairs of tokens at lag 1 (e.g. ac, cb, ba, : : : ) 7 pairs have b as the precursor; 6 of those have a as the successor, and one has something other than a (denoted a), as the successor. The following contingency table represents this information: Table(b,a,1) = a a total

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Distributed Approach to Finding Complex Dependencies in Data

Learning complex dependencies from time series data is an important task; dependencies can be used to make predictions and characterize a source of data. We have developed Multi-Stream Dependency Detection (msdd), a machine learning algorithm that detects complex dependencies in categorical time-series data. dmsdd attempts to balance the search for strong dependencies across a heterogeneous net...

متن کامل

Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys

In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian, joint modeling approach to multiple imputation fo...

متن کامل

A Generalized Markov Chain Approach for Conditional Simulation of Categorical Variables from Grid Samples

Complex categorical variables are usually classified into many classes with interclass dependencies, which conventional geostatistical methods have difficulties to incorporate. A two-dimensional Markov chain approach has emerged recently for conditional simulation of categorical variables on line data, with the advantage of incorporating interclass dependencies. This paper extends the approach ...

متن کامل

Tools for detecting dependencies in AI systems

We present a methodology for learning complex dependencies in data based on streams of categorical, time series data. The streams representation is applicable in a variety of situations: a program's execution trace may be thought of as a stream. The various monitor readings of an intensive care unit may be thought of as concurrent streams. Our learning methodology, called dependency detection, ...

متن کامل

Utilizing mutual information for detecting rare and common variants associated with a categorical trait

Background. Genome-wide association studies have succeeded in detecting novel common variants which associate with complex diseases. As a result of the fast changes in next generation sequencing technology, a large number of sequencing data are generated, which offers great opportunities to identify rare variants that could explain a larger proportion of missing heritability. Many effective and...

متن کامل

Searching for Structure in Multiple Streams of Data

Finding structure in multiple streams of data is an important problem. Consider the streams of data owing from a robot's sensors, the monitors in an intensive care unit, or periodic measurements of various indicators of the health of the economy. There is clearly utility in determining how current and past values in those streams are related to future values. We formulate the problem of nding s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995